GATEtoGerManC: A GATE-based Annotation Pipeline for Historical German

نویسندگان

  • Silke Scheible
  • Richard J. Whitt
  • Martin Durrell
  • Paul Bennett
چکیده

We describe a new GATE-based linguistic annotation pipeline for Early Modern German, which can be used to annotate historical texts with word tokens, sentence boundaries, lemmas, and POS tags. The pipeline is based on a customisation of the freely available ANNIE system for English (Cunningham et al., 2002), in combination with a version of the TreeTagger (Schmid, 1994) trained on gold standard Early Modern German data. The POS-tagging and lemmatisation components of the pipeline achieve an average accuracy of 89.44% and 83.16%, respectively, on unseen historical data from various genres and publication dates within the Early Modern period. We show that normalisation of spelling variation can further improve these results. With no specialised tools available for processing this particular stage of the language, this pipeline will be of particular interest to smaller, humanities-based projects wishing to add linguistic annotations to their historical data but which lack the means or resources to develop such tools themselves.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Annotation in Architecture: A Systematic Approach toward Mobilization and Development of Theoretical, Research, and Critical Basis in Architecture

Annotations usually refer to marginal notes that explain a difficult or ambiguous subject, provide a general definition or a critical remark for a particular part of a text. Historically, annotating was a well-known tradition in Islamic sciences and was used especially in times when there were less new potentials for generating new knowledge. The main question of this research is, can the tradi...

متن کامل

Implementation and Evaluation of a Negation Tagger in a Pipeline-based System for Information Extraction from Pathology Reports

We have developed a pipeline-based system for automated annotation of Surgical Pathology Reports with UMLS terms that builds on GATE--an open-source architecture for language engineering. The system includes a module for detecting and annotating negated concepts, which implements the NegEx algorithm--an algorithm originally described for use in discharge summaries and radiology reports. We desc...

متن کامل

A Gold Standard Corpus of Early Modern German

This paper describes an annotated gold standard sample corpus of Early Modern German containing over 50,000 tokens of text manually annotated with POS tags, lemmas, and normalised spelling variants. The corpus is the first resource of its kind for this variant of German, and represents an ideal test bed for evaluating and adapting existing NLP tools on historical data. We describe the corpus fo...

متن کامل

Novel Design of n-bit Controllable Inverter by Quantum-dot Cellular Automata

Application of quantum-dot is a promising technology for implementing digital systems at nano-scale.  Quantum-dot Cellular Automata (QCA) is a system with low power consumption and a potentially high density and regularity. Also, QCA supports the new devices with nanotechnology architecture. This technique works </...

متن کامل

Optimized Design of Multiplexor by Quantum-dot CellularAutomata

Quantum-dot Cellular Automata (QCA) has low power consumption and high density and regularity. QCA widely supports the new devices designed for nanotechnology. Application of QCA technology as an alternative method for CMOS technology on nano-scale shows a promising future. This paper presents successful designing, layout and analysis of Multiplexer with a new structure in QCA technique. In thi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012